78.3% observations in the dataset have no defaulted loan while 21.7% have defaulted loan. It is slighly imbalanced dataset.

Categorical Features Analysis

Statistical Hypothesis Techniques

Main focus of the Hypothesis Testing is to draw relations and infer. insights between loan default and different features in this dataset (df)

Test of Normality

Normality:The population data sould follow normal ( by default alpha =5%)

H0: pop data = Normal

H1: pop data != normal

Note: if p value >=alpha --> pop data = Normal

Sicne p value < alpha (0.05) --> reject H0.

Hence we can conclude that pop data of LTV ratio does not folow normal distribution

So we can not do the parametric test. proceed for non parametric test

Test of Median

H0: pop LTV >=50

H1:pop LTV <50

WilcoxonResult(statistic=570401111.5, pvalue=1.0)

sicne p value( close to 0) is greater than alpha( 5%) --> Accept H0

Inference: We can conclude that the pop median of Loan to value (LTV) is >50.

2 Independent Samples test

Prob: Test whether the ltv value is same for Employment_Type?

Test of Proportion

H0: P=0.3 : The Default Loan propotion rate is 30%

H1: P!=0.3 : The Default Loan rate is not 30%

Since p value < alpha(0.05)--> Reject H0. The loan default rate is not 30 %.

Two samples Test of Proportion

Test whether the loan proportion is different for different Employment type?

H0: P_Salaried= P_Self Employed; P_salaried-P_self emplyed=0 : There is no difference in loan proportion

H1: P_Salaried != P_Self Employed ; P_salaried-P_self Employed!=0 : There is difference in loan proportion

Since p value (0.0) < alpha(0.05) --> reject H0

We can conclude that there is difference in Loan Default proportion of Salaried and Self Employed

Chi Square test

Is there any assoication b/w Loan Default and Region?

Ho:There in no assoication or independency b/w Loan Default and Region

H1:There is an association or dependency b/w Loan Default and Region

Chisquare Value Analysis:

chisquare_stat>chi square_critical ---> reject H0

PVA:

p-value <alpha(0.05)-->Reject H0

hence we can conclude that there is an association b/w Loan Default and region in this given dataset

Machine Learning Model Building

Logistic Regression

Linear Regression

Random Forest Classifier

Modelling without SMOTE

Modelling with SMOTE

Inference

The problem statement asks us to calculate the likelihood of a loanee/borrower defaulting on a loan. As a result, in addition to predicting whether a person is a defaultee or not, we must also predict the likelihood that a person will default on the loan.

As a result, we used AUC-score, F1-score of 1's, and Binary Log Loss as performance metrics to assess model performance.

All models give significantly lower F1-scores when SMOTE is not used (1s). This problem has been resolved  by the use of SMOTE (Though the f1-scores can be controlled by selecting the appropriate threshold from the ROC curve).

Looking at the performance metrics of various models in the table above, we can see that Logistic Regression with SMOTE performs extremely well when compared to other models. It produces good AUC scores (without overfitting) and the best F1-Score (1). When compared to other models, the Binary log loss is slightly higher.

Random Forest Classifier with SMOTE is the next best model. AUC scores show that it is overfitting when compared to Logistic Regression. It does, however, have a good F1-score(1), which is slightly lower than Logistic Regression. When compared to Logistic Regression, it has a better(lower) binary log loss.